We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The NHANES III data set from 2009-2010 is an ongoing and continuous series of surveys focused on civilian non-institutional population fo the United states and published by CDC every year. It is designed to be a surveillance of specific diseases and behaviors, providing statistical insights into the U.S population.(G et al. 2013). The survey program assesses the health and nutritional status of adults and children in the United States since the 1960s, combining in-person face-to-face interviews and physical examinations of participants for data collection according to CDC (2023a).
The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.
In this assignment, we accessed the data set from the aplore3 package in R, which contains 6482 observations in 21 variables. Eight of these variables (gender, marstat, vigwrk, modwrk, wlkbik, vigrecexr, and obesity) are considered as categorical variables, while the rest are numerical variables. We aim to study the relationship between the weight variable and the other health related variables of the data.
The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. While there are other indicators like waist circumference and waist-to-height ratio for better measurement of human health conditions (LB et al. 2016), we didn’t have these data available in the package, therefore we categorized the weight by BMI levels. The nhanes data from the package came with the indicator “obese” showing if the participant was obese or not by considering BMI level greater than a threshold of 35, thus resulted in the obese variable as the binary random variable. However, a slightly better way to categorizing the weight was introduced by CDC’s guideline (2022a). According to CDC’s classification on body weight, we have: BMI \(\leq\) 18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI \(\geq\) 30 as obesity. In our assignment we will explore both ways of categorizing the weight to properly fit in our analysis.
We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests to find the relationship between the weight variable and other variables.
We first of all gave Table 1 below to illustrate the overall appearance of the variables given the weight being categorized under CDC’s approach. For categorical variables, the first three columns presented the total number of observation and percentage for different levels of BMI, and the last column was the sum of the first three columns and the percentage of the total observations. For numerical variables, the first three columns were the mean and standard deviation under the stratification condition and the last column was the mean and standard deviation under overall level. The rows showed the variable name with its subcategories. If the variable had missing values, there was an additional row showing the statistic. In the original data set, it existed 6482 observations and 37 were missing for the variable bmi. Due to small size of the missing data less than 0.6% in total, We omited these missing data for bmi and only considered 6445 observations as total sample size. However, as one can see that some variables had a high percentage of overall missing values. the highest being Marital Status as 9.7%. This high percentage of missing value could affect one’s analysis, for example in a linear regression model estimating the relationships between the weight and others. One could question if the marital status really had any influence on the body weight.
We had several numerical variables for which we wanted to explain, namely the cholesterol levels and the blood pressure. Cholesterol is a type of lipid that helps body perform important body functions. The variable “tchol” indicated the Total Cholesterol, the total amount of cholesterol circulated in people’s blood. The variable “hdl” indicated the HDL-Cholesterol, which is the high-density lipoprotein cholesterol that helps people transfer excess cholesterol from the blood to the liver (2022b). “sysbp” standed for Systolic Blood Pressure and “dbp” stands for Diastolic Blood Pressure. “Systolic pressure represents the maximum blood pressure when ventricles contract while diastolic pressure denotes the minimum blood pressure registered just before the subsequent contraction (Walker, Hall, and Hurst 1990). We discussed the thresholds for cholesterol and blood pressure in the following discussion. Two other numerical variables in this data set were worth noticing, psu and strata. They were the Pseudo-PSU and Pseudo-stratum used in sampling strategy. In this data set, stratum was defined by geography and psu were selected from every stratum with probability proportional to a measure of size(PPS). More details about survey weights for NHANES will be presented in the extra topic section.
We also had several categorical variables. One can find the work or recreational activities, each corresponding to the levels of intensity being vigorous and moderate. Level of vigorous was defined by the experimental design such that the intense physical exertion leading to significant elevation in respiration and heart rate. It usually sustained for more than 10 minutes during the work or recreational activities. Moderate work and recreational activities were defined such that activities require moderate physical effort and results in slight increase in respiration and heart rate, which usually sustained for less than 10 minutes during the work or recreational activities.(2011)
| Healthy weight | Obesity | Overweight | Underweight | Overall | |
|---|---|---|---|---|---|
| (N=1883) | (N=2311) | (N=2127) | (N=124) | (N=6445) | |
| Gender | |||||
| Male | 897 (47.6%) | 1036 (44.8%) | 1171 (55.1%) | 40 (32.3%) | 3144 (48.8%) |
| Female | 986 (52.4%) | 1275 (55.2%) | 956 (44.9%) | 84 (67.7%) | 3301 (51.2%) |
| Age (years) | |||||
| Mean (SD) | 41.2 (± 20.6) | 48.7 (± 17.7) | 48.9 (± 19.0) | 37.9 (± 21.0) | 46.4 (± 19.4) |
| Marital Status | |||||
| Married | 741 (39.4%) | 1158 (50.1%) | 1074 (50.5%) | 31 (25.0%) | 3004 (46.6%) |
| Widowed | 121 (6.4%) | 185 (8.0%) | 190 (8.9%) | 8 (6.5%) | 504 (7.8%) |
| Divorced | 154 (8.2%) | 262 (11.3%) | 210 (9.9%) | 14 (11.3%) | 640 (9.9%) |
| Separated | 47 (2.5%) | 82 (3.5%) | 63 (3.0%) | 1 (0.8%) | 193 (3.0%) |
| Never Married | 351 (18.6%) | 353 (15.3%) | 289 (13.6%) | 30 (24.2%) | 1023 (15.9%) |
| Living Together | 141 (7.5%) | 148 (6.4%) | 157 (7.4%) | 8 (6.5%) | 454 (7.0%) |
| Missing | 328 (17.4%) | 123 (5.3%) | 144 (6.8%) | 32 (25.8%) | 627 (9.7%) |
| Statistical Weight | |||||
| Mean (SD) | 36700 (± 26000) | 33000 (± 25100) | 34200 (± 26300) | 37400 (± 27800) | 34600 (± 25800) |
| Pseudo-PSU | |||||
| Mean (SD) | 1.51 (± 0.500) | 1.50 (± 0.500) | 1.51 (± 0.500) | 1.50 (± 0.502) | 1.51 (± 0.500) |
| Pseudo-stratum | |||||
| Mean (SD) | 7.11 (± 4.09) | 7.36 (± 4.13) | 7.15 (± 4.16) | 7.80 (± 4.14) | 7.22 (± 4.13) |
| Total Cholesterol (mg/dL) | |||||
| Mean (SD) | 185 (± 39.9) | 194 (± 40.5) | 198 (± 42.8) | 172 (± 33.4) | 192 (± 41.4) |
| Missing | 123 (6.5%) | 142 (6.1%) | 121 (5.7%) | 6 (4.8%) | 392 (6.1%) |
| HDL-Cholesterol (mg/dL) | |||||
| Mean (SD) | 58.4 (± 17.1) | 47.6 (± 13.7) | 51.8 (± 15.5) | 63.3 (± 17.1) | 52.5 (± 16.0) |
| Missing | 124 (6.6%) | 142 (6.1%) | 120 (5.6%) | 6 (4.8%) | 392 (6.1%) |
| Systolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 119 (± 18.5) | 125 (± 17.3) | 125 (± 18.5) | 111 (± 18.5) | 123 (± 18.3) |
| Missing | 164 (8.7%) | 206 (8.9%) | 154 (7.2%) | 20 (16.1%) | 544 (8.4%) |
| Diastolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 67.4 (± 11.2) | 71.3 (± 12.4) | 69.8 (± 11.8) | 65.7 (± 11.3) | 69.6 (± 11.9) |
| Missing | 167 (8.9%) | 230 (10.0%) | 170 (8.0%) | 18 (14.5%) | 585 (9.1%) |
| Weight (Kg) | |||||
| Mean (SD) | 63.1 (± 9.13) | 99.0 (± 17.7) | 77.3 (± 10.3) | 47.9 (± 5.56) | 80.4 (± 20.2) |
| Standing Height (cm) | |||||
| Mean (SD) | 168 (± 10.0) | 167 (± 10.4) | 168 (± 10.4) | 166 (± 7.59) | 167 (± 10.2) |
| Vigorous Work Activity | |||||
| Yes | 324 (17.2%) | 418 (18.1%) | 371 (17.4%) | 16 (12.9%) | 1129 (17.5%) |
| No | 1558 (82.7%) | 1893 (81.9%) | 1756 (82.6%) | 108 (87.1%) | 5315 (82.5%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Work Activity | |||||
| Yes | 651 (34.6%) | 796 (34.4%) | 701 (33.0%) | 32 (25.8%) | 2180 (33.8%) |
| No | 1231 (65.4%) | 1515 (65.6%) | 1426 (67.0%) | 92 (74.2%) | 4264 (66.2%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Walk or Bicycle | |||||
| Yes | 630 (33.5%) | 549 (23.8%) | 573 (26.9%) | 48 (38.7%) | 1800 (27.9%) |
| No | 1252 (66.5%) | 1762 (76.2%) | 1554 (73.1%) | 76 (61.3%) | 4644 (72.1%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Vigorous Recreational Activities | |||||
| Yes | 579 (30.7%) | 344 (14.9%) | 449 (21.1%) | 27 (21.8%) | 1399 (21.7%) |
| No | 1303 (69.2%) | 1967 (85.1%) | 1678 (78.9%) | 97 (78.2%) | 5045 (78.3%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Recreational Activities | |||||
| Yes | 834 (44.3%) | 791 (34.2%) | 823 (38.7%) | 37 (29.8%) | 2485 (38.6%) |
| No | 1048 (55.7%) | 1520 (65.8%) | 1303 (61.3%) | 87 (70.2%) | 3958 (61.4%) |
| Missing | 1 (0.1%) | 0 (0%) | 1 (0.0%) | 0 (0%) | 2 (0.0%) |
| Minutes of Sedentary Activity per Week (mins) | |||||
| Mean (SD) | 316 (± 185) | 333 (± 186) | 308 (± 184) | 366 (± 195) | 321 (± 186) |
| Missing | 17 (0.9%) | 34 (1.5%) | 26 (1.2%) | 1 (0.8%) | 78 (1.2%) |
| Obese | |||||
| No | 1883 (100%) | 1325 (57.3%) | 2127 (100%) | 124 (100%) | 5459 (84.7%) |
| Yes | 0 (0%) | 986 (42.7%) | 0 (0%) | 0 (0%) | 986 (15.3%) |
We first began our analysis on the relationship between the weight variable with the Age and Gender variables. This data set mainly focused on the observers between 16 to 80 years old. Among them, the average weight for male was greater than female among all ages, and as we can see from the line chart that the change in average weight with age followed the same trend across the gender, with a general tendency to sustained increase, followed by fluctuation and continuous decrease finally. One can conclude that there might existed some relationship between weight and age.
Figure 3.1: Average Weight in Different Age
In order to find the relationship between weight and gender or age , we first categorized the numerical variables. As mentioned in Table 1, we use the BMI level defined by CDC for categorize the weight in the following analysis. For age, we divided it into three groups based on NIH recommendations: younger than 18 for adolescents, 18 to 65 for adults, and older than 65 for older adults. Then we created the contingency tables for variables and used pearson GOF chi-square test and likelihood ratio test to testing independence and discussing difference from these test methods.
| Adolescents | Adults | Older Adults | ||
|---|---|---|---|---|
| BMI level | Underweight | 17 | 86 | 21 |
| Healthy | 188 | 1322 | 343 | |
| Overweight | 74 | 1536 | 518 | |
| Obesity | 63 | 1765 | 512 |
| Male | Female | ||
|---|---|---|---|
| BMI level | Underweight | 40 | 84 |
| Healthy | 884 | 969 | |
| Overweight | 1168 | 960 | |
| Obesity | 1052 | 1288 |
| Weight VS Age | Weight VS.Gender | |
|---|---|---|
| p-value | 2.543e-32 | 6.311e-13 |
| Weight VS Age | Weight VS.Gender | |
|---|---|---|
| p-value | 1.976e-29 | 5.224e-13 |
We observed that under the Chi-squared test, the p-valve for weight and age was \(2.543\times10^{-32}\) and the p-value for weight and gender was \(6.311\times10^{-13}\).The p-values from likelihood ratio test were slightly different with \(1.976\times10^{-29}\) for weight and age, \(5.223\times10^{-13}\) for weight and gender. Both two methods had the p-value much less than 0.001, hence indicated that we could reject the null hypothesis and concluded there existed relationship between weight and age, and weight and gender. One can further test that there is a linear relationship between the weight and age given gender by fitting a linear model. Because of the presence of the trend, one might do three linear models on three categories of the age variable, or one might consider a polynomial regression model for the weight response and the overall age predictor, which would not be further discussed in this assignment.
We then analyzed the relationship between weight and marital status. The following box plot shows that the median weight under different marital status are all around 80 Kg, except widowed observations having lowest weight among the six categories. Married and Never Married observations had more people heavier than 130 Kg than other categories, however all the outliers above 140 kilograms were rather similar, therefore from this boxplot we could not really see a difference of the weight distribution among categories of marital status.
Figure 3.2: Boxplot of Weight for Different Marital Status
This prompt us to question if there was really a relationship between the variables, so we tested the independence using the Chi-squared test. For simplicity we considered categorizing the weight to a binary “obese” random variable by defining obesity as BMI level \(\geq 35\), as given in the data set. We formed the contingency table Table 3.6.
| No | Yes | ||
|---|---|---|---|
| Marital Status | Married | 2530 | 474 |
| Widowed | 418 | 86 | |
| Divorced | 528 | 112 | |
| Separated | 158 | 35 | |
| Never Married | 863 | 160 | |
| Living Together | 388 | 66 |
| marstat | |
|---|---|
| p-value | 0.6894 |
Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:
\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]From the p-value being equal to 0.6894,we concluded that there was not enough evidence to reject the null hypothesis. In other words, we could not conclude that there was a relationship between obesity and marital status, hence supporting our initial guess in the boxplot analysis.
Cholesterol is an essential fat in the body. We first gave a exploratory analysis on the relationship between the weight and the cholesterol variables. We adopted CDC’s category for weight and plotted the mean levels of total cholesterol and HDL cholesterol in Figure 3.5. We found that there was a slight positive relationship between body weight and the total cholesterol level and noticed that there was a negative relationship between the HDL and body weight. Because of the fact that Total cholesterol level is the sum of HDL and LDL level, we can conclude that the obese population has a high level of LDL and a low level HDL. Our analysis was inline with a more recent study on the effect of BMI on lipid profile in children and adolescents in Saudi Arabia (AA and AE 2019). In this study the researchers concluded that “High BMI was found to be associated with increased levels of LDL cholesterol and decreased levels of HDL cholesterol. No significant association between gender and changes in lipid profile was established (P = 0.898)”.
Figure 3.3: Mean Cholesterol level across Categories of Body Weight
We then categorized the cholesterol and blood pressure level into different levels. According to American Heart Association ATPIII (Cleeman 2001), We have: tchol \(\geq\) 240mg/dl OR hdl \(<\) 40mg/dL as Dangerous Cholesterol level; tchol between 200-239mg/dL OR hdl between 40-59mg/dL for males (50-59mg/dL for females) as At Risk level; tchol \(<\) 200mg/dL and hdl \(\geq\) 60mg/dL as Healthy level.
Blood pressure is typically categorized into different stages based on the systolic (top number) and diastolic (bottom number) readings. According to American Heart Association (2023b), We have: systolic level \(<\) 120mm Hg and diastolic level \(<\) 80mm Hg as normal blood pressure; systolic level between 120-129mm Hg and diastolic level \(<\) 80mm Hg as elevated blood pressure; systolic level between 130-139mm Hg OR diastolic level between 80-89mm Hg as Hypertension Stage 1; systolic level \(\geq\) 140mm Hg OR diastolic level \(\geq\) 90mm Hg as Hypertension Stage 2; systolic level \(>\) 180mm Hg OR diastolic level \(>\) 120mm Hg as Hypertensive Crisis.
After categorizing Cholesterol level and blood pressure, we create three contingency tables in Table 3.8, Table 3.9 and Table 3.10 below.
| At Risk | Dangerous | Healthy | ||
|---|---|---|---|---|
| Weight | Healthy | 811 | 294 | 471 |
| Obese | 856 | 776 | 363 | |
| Overweight | 884 | 622 | 303 | |
| Underweight | 44 | 10 | 46 |
| Elevated | Hypertension Stage 1 | Hypertension Stage 2 | Hypertensive Crisis | Normal | ||
|---|---|---|---|---|---|---|
| Weight | Healthy | 234 | 255 | 200 | 17 | 870 |
| Obese | 330 | 522 | 412 | 10 | 721 | |
| Overweight | 309 | 383 | 344 | 24 | 749 | |
| Underweight | 10 | 9 | 8 | 1 | 72 |
| Elevated | Hypertension Stage 1 | Hypertension Stage 2 | Hypertensive Crisis | Normal | ||
|---|---|---|---|---|---|---|
| Cholesterol | At Risk | 420 | 587 | 448 | 25 | 1115 |
| Dangerous | 302 | 391 | 338 | 14 | 657 | |
| Healthy | 161 | 191 | 178 | 13 | 640 |
| Weight VS Cholesterol | Weight VS. Blood Pressure | Cholesterol VS. Blood Pressure | |
|---|---|---|---|
| p-value | 1.475e-53 | 5.723e-35 | 3.286e-13 |
We again conducted hypothesis testing using Pearson chi-squared test. From Table 3.11 we could see that the p-value for weight and cholesterol is extremely small, equals to \(1.475\times10^{-53}\). This suggested that there was a statistically significant association between weight and cholesterol levels. For weight against blood pressure, the p-value was \(5.723\times10^{-35}\) which indicates that between weight and blood pressure, there was significant relationship. Cholesterol level was also associated with blood pressure with a p-value equals to \(3.286\times10^{-13}\).
However, there are potential confounding variables like diet, therefore for further study one needs to collect the data and conduct a stratified analysis.
Overall, the results suggested that there was a significant association between weight, cholesterol, and blood pressure. The patterns observed in the contingency tables align with general health knowledge: obesity is a risk factor for both high cholesterol and high blood pressure.
We then want to analyze the relationship between body weight and human activities. The two types of activity measurements were given in the data, the work activity and the recreational activity, each given at intensity levels of vigorous and moderate.
Figure 3.4: Boxplot of BMI for Different Recreational Activity Conditional on Work Activity
From the left plot of Figure 3.4 above one can see that the vigorous recreational activities yield BMI observations between the healthy range of 18.5 to 30, regardless of the condition on work activities. However, the non-vigorous recreational activities yield systemically higher levels of BMI than that of the vigorous recreational activities, again regardless of the condition on work activities. From the right-hand side plot we observed the same pattern when considering moderate intensity of work and recreational activities. If one do a vertical comparison between the intensity of activities, one could see that there was no significant difference except that vigorous recreational activities had the 25% to 75% quantile range of BMI a little closer to the healthy interval defined by CDC than moderate recreational activities did. Assuming the risk \(p\) is defined as the probability of having a BMI NOT inside the range of 18.5 to 30, i.e. either being underweight or obesity. We defined \(p_{VigR=Y}\) as the risk for the participants with vigorous recreational activities, and \(p_{VigR=N}\) is the risk for those with no vigorous recreational activities. We therefore defined a binary random variable called “InRange” to indicate that if the body weight is within the healthy range. We proposed the following two statements:
\(p_{VigR=Y} < p_{VigR=N}\), perhaps independent of Work Activities
Intensity of recreational activities also plays a role in getting healthy weight
To support our first statement, we first did the independence test for all activity-related variables with obesity. Based on the pvalues shown in Table 3.12, we rejected the independence between InRange and wlkbik, vigrecexr and modrecexr variables, and also concluded that we didn’t have enough evidence to reject the independence between the InRange variable with Work Activities.
| vigwrk | modwrk | wlkbik | vigrecexr | modrecexr | |
|---|---|---|---|---|---|
| p-value | 0.645 | 0.8284 | 2.138e-06 | 1.413e-22 | 5.188e-09 |
Then we computed the marginal and aggregated Odds Ratios given levels of Vigorous Work Activities.
| VigR=Yes | VigR=No | ||
|---|---|---|---|
| InRange | Yes | 221 | 474 |
| No | 97 | 337 |
| VigR=Yes | VigR=No | ||
|---|---|---|---|
| InRange | Yes | 807 | 2507 |
| No | 274 | 1727 |
| VigR=Yes | VigR=No | ||
|---|---|---|---|
| InRange | Yes | 1028 | 2981 |
| No | 371 | 2064 |
| Estimated Odds Ratio | |
|---|---|
| VigW=Y | 1.619840 |
| VigW=N | 2.028902 |
| Aggregated VigW | 1.918523 |
In all cases we have an Odds ratio greater than 1, indicating that doing vigorous recreational activities can lower the odds of getting a risky BMI, regardless of having the vigorous work activities or not. This gave us a potential suggestion to those who have to work for a long time without gaining adequate physical activities. Recreational activity seems to play a more important role in maintaining a healthy level of body weight.
We then checked that given the body weight falls within the healthy range, was there a difference between which intensity of recreational activities has been experienced. Let the probability of having a BMI inside the healthy range of 18.5 to 30 for doing vigorous recreational activities as \(p_{VigR}\), and for moderate recreational activities as \(p_{ModR}\). We want to test the hypothesis:
\[\begin{gather*} H_0: p_{VigR} \leq p_{ModR}\\ H_1:p_{VigR} > p_{ModR} \end{gather*}\]| InRange | Vigorous Recreational=Y | Vigorous Recreational=N | Moderate Recreational=Y | Moderate Recreational=N |
|---|---|---|---|---|
| No | 371 | 2064 | 828 | 1607 |
| Yes | 1028 | 2981 | 1657 | 2351 |
| VigR_Yes | VigR_No | |
|---|---|---|
| pvalue | 3.4e-06 | 0.3830068 |
We gained the MLE of the parameters from the Frequency Table 3.16 as:
\[\begin{gather*} p_{VigR=Y} = \frac{1028}{1028+371}, ~ p_{ModR=Y} = \frac{1657}{1657+828} \\ p_{VigR=N} = \frac{2981}{2981+2064}, ~ p_{ModR=N} = \frac{2351}{2351+1607} \\ \end{gather*}\]
Our conclusion was that we did see there was a higher probability of getting the healthy weight when Vigorous Recreational Activities were done. However when choosing the moderate level of recreational activities, there’s no conclusion in the hypothesis test.
In all, we saw from our analysis that having vigorous or moderate recreational activities tend to give healthy range of BMI, while moderate or vigorous work activities might not have an significant influence on body weight conditions.
Most large-scale surveys often involve a combination of multiple sampling design techniques, like stratification and cluster sampling, enable researchers to obtain accurate estimates while catering to practical and cost considerations. A central tenet to these designs is the concept of sampling weights, pivotal in ensuring unbiased estimation.
Sampling weights are best defined as the inverse of the probability that a specific unit gets selected in the sample. These weights adjust for design-imposed inequalities in selection probabilities and are used tp compute point estimates.
In the case of the stratified random sampling. The population U of size N is partitioned into stratums denoted by \(U_1,...,U_h,...,U_H\). The size of \(h\)th stratum is denoted by \(N_h\).In stratum \(h\), a random sample \(S_h\) of size \(n_h\) is selected based on a sampling design, here we use simple random sampling for simplicity and efficiency. In this stratified random sampling, the estimate of population total can be show as following(Lohr 2022):
\(\hat{t}_{str}=\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}\)
where the \(w_{hj}=N_h/n_h\) represents the sampling weight of the \(j\)th observation in the \(h\)th stratum, \(y_{hj}\). Note that the probability of sample selection of the \(j\)th unit in the \(h\)th stratum is \(\pi_{hj}=n_h/N_h\).In this case, the sampling weight is the inverse of such probability \(\pi_{hj}\). The unbiased estimator of the population mean \(\bar{y}_U\) can also be shown with sampling weight as following(Lohr 2022):
\(\hat{\bar{y}}_{str}=\frac{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}y_{hj}}{\sum_{h=1}^H\sum_{j\in S_h}w_{hj}}\)
Cluster sampling is another complex sampling technique for large-scale surveys when the population elements are dispersed and the the fieldwork is costly. We sample the primary sampling units(psu’s) which are often from the natural groupings of the population elements. In cluster sampling, we have \(N\) as the number of psu’s in the population. \(i\)th psu contains \(M_i\) elements. For sample of psu’s, \(S\), We denote \(n\) as the number of psu’s in the sample. For a two-stage cluster sampling, a \(S_i\) sub-sample of secondary sampling units(ssu’s) is chosen from \(i\)th psu, \(i=1,...,n\). The sub-sample size is \(m_i\).
In the case of two-stage cluster sampling with equal probabilities, the sampling weight can be expressed as the following: \(w_{ij}=1/\pi_{ij}=\frac{NM_i}{nm_i}\), where \(\pi_{ij}\) is the probability that the \(j\)th ssu in the \(i\)th psu is in the sample. For unequal probabilities, we need the probability that the \(i\)th psu is in the sample,\(\pi_i\), and the probability that the \(j\)th ssu is in the sample given that the \(i\)th psu is in the sample, \(\pi_{j|i}\). Then, the sampling weight is given by \(w_{ij}=1/(\pi_i\pi_{j|i})\).
With the sampling weight shown above, the estimator of the population total in cluster sampling can also be show in the following forms(Lohr 2022):
\(\hat{t}=\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}\)
and the estimator of the population mean:
\(\hat{\bar{y}}=\frac{\sum_{i\in S}\sum_{j\in S_i}w_{ij}y_{ij}}{\sum_{i\in S}\sum_{j\in S_i}w_{ij}}\)
In essence, sampling weights in complex survey designs play a crucial role in ensuring that the survey results are both accurate and generalizable to the broader population.
In our analysis of the NHANES dataset, significant associations were identified between body weight, age, gender, cholesterol levels, and blood pressure. Lifestyle factors, especially of vigorous recreational activities, were linked to healthier BMI ranges, emphasizing their importance in weight management. While marital status did not show a significant relationship with obesity, other factors like age and type of physical activity did impact weight. However, these findings should be interpreted cautiously due to potential confounding variables not accounted for in the dataset.